"The Titanic was thought to be the most luxurious and safest ship of its time, generating legends that were supposedly "unsinkable"."
The voyage that culminated in the sinking of the RMS Titanic began in Southampton / United Kingdom on 10/04/1912, passing through Cherbourg-Octeville / France and Queenstown / Ireland. On April 14 he collided with an iceberg and sank, the next morning, with another 1500 people on board.
A curious fact, which may have helped us in our analysis of data, was that during the evacuation of the passengers Captain Smith went to two officers (Lightoller and Murdoch) and said, "put the women and children and lower them." The two interpreted the order differently, and Ligthtoller understood that they could only board women and children, and when there was no one in that group, he would throw the empty boats. Murdoch allowed men to board after women and children. Because they did not know the total number of people the boat held, several were thrown at half their capacity unused.
No more doubts, let's start analyzing the data!
#importando as bibliotecas
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sb
import pandas as pd
import numpy as np
#Machine learning
from sklearn.ensemble import RandomForestClassifier
from sklearn import model_selection, tree, preprocessing
filename = 'titanic_data.csv'
titanic_df = pd.read_csv(filename, header=0)
titanic_df.head()
At this stage we will clean the data and remove everything that may interfere with the result of the analysis and the prediction of data. Like everything else, you might consider null values, columns with many null values, columns with information that is not relevant to the focus of the analysis, values well above or below average, and zero values. The numbers and the decision to keep or not a data will be demonstrated in this section.
Knowing the columns of the dataset better:
From the questions that have been raised we will assume that the variables that should be investigated further are:
# Verifica o tipo dos dados criados
titanic_df.dtypes
The existence of columns with missing data will be checked because these absences may interfere with the analysis. The treatment of missing data will take into account the percentage (%) of absences and the relevance of the information to the conclusion. We begin this process by identifying which columns are missing values by counting the existing values and dividing by the total number of rows in the column. colunas com valores ausentes através da contagem dos valores existentes e divisão pelo total de linhas da coluna.
((len(titanic_df) - titanic_df.count()) / len(titanic_df)) *100
Missing Values: Age
20% of the data in the Age column is missing. We know that age was one of the selection criteria that allowed the passenger to enter the lifeboat. So, assuming an importance, I will not choose to delete the column or the rows without values. For this case, a good strategy will be to assign the average of the column to all absent.
titanic_df.Age.fillna(0).describe()
Assigning the column mean in all cells with NaN values:
titanic_df.Age = titanic_df.Age.fillna(titanic_df.Age.mean())
Missing Values: Cabin
titanic_df.Cabin.describe()
Missing Values: Embarked
Embarked has very few missing values (0.2%), in which case it would be appropriate to exclude these records, but the port where the passenger embarked has no relevance in the perspective of this analysis. This column will be deleted completely at the end of this section.
# Embarked NaN values
titanic_df.Embarked.describe()
During the analysis it may be necessary to create other data from the existing ones, this feature will be used whenever a benefit is identified in seeing the information from another perspective or accessing it more easily.
Derived data: Family
# Create a new column: Family
titanic_df['Family'] = titanic_df['SibSp'] + titanic_df['Parch']
Derived data: Age group (AgeRange)
The AgeRange column will be created to facilitate analysis of the surviving age group. Despite being 32 years old, I will expand the search for age so that the analysis does not get so punctual.
def age_range(idade):
"""
Returns the age range for the given age.
Args:
Age: Value number representing age.
Returns:
Returns a string with the age range identified for age.
Values Domain = Elderly, Adult, Young Adult, Teen and Child.
"""
if idade >= 65:
return 'Elderly'
elif idade >= 33:
return 'Adult'
elif idade >= 18:
return 'Young Adult'
elif idade >= 12:
return 'Teen'
else:
return 'Child'
# Calls the age_range function passing as the Age column parameter and assigns the result to the new AgeRange column
titanic_df['AgeRange']= titanic_df.Age.apply(age_range)
titanic_df.AgeRange.describe()
My "Young Adult" group is the most frequent of the data set. See the data distribution:
# Print chart with total passengers by age group.
titanic_df.groupby(['AgeRange']).size().plot()
The last step of the manipulation is to exclude from the dataset the variables considered irrelevant or inadequate. SibSp and Parch will be deleted because their contents are replicated in the Family column.
# Delete columns
titanic_df.drop(['PassengerId','SibSp', 'Parch', 'Cabin','Embarked', 'Ticket','Name'], axis=1, inplace=True)
titanic_df.head()
Applied to numerical values
titanic_df.describe()
Comments:
In spite of having treated the data in the 'Working the data' section, when describing the data in the 'statistical summary', I realized that the data set gives the impression that the passengers were accompanied, the lowest crossing value is 0 and the largest Is 512, far from the average that is 32.
We will investigate and treat these variables.
The Family average is 0.9, giving the impression that most of the passengers were accompanied. Confirming this information:
((len(titanic_df.Family) - titanic_df[titanic_df['Family']>0].count()) / len(titanic_df.Family)) *100
60% of passengers were unaccompanied. Well, if most of the passengers were unaccompanied, it is possible that the total number of passengers is close to the total number of passengers.
# Sum Family
titanic_df.Family.sum()
The sum of family members is almost similar to the total of the Family column. When averaged, you get the impression that most of the passengers are accompanied.
Checking if any pattern in this data:
# List the unique values of family
titanic_df.Family.unique()
We already know that most of the passengers were unaccompanied, is there a different relationship of this variable with the dead and survivors?
It groups the data of Family and Survived, in order to find out if there is a relation between the number of companions and the survival.
acompanhante_sobrevivencia = titanic_df.groupby(['Family','Survived']).size()
acompanhante_sobrevivencia
Viewing the survivor distribution map by number of companions:
sb.heatmap(acompanhante_sobrevivencia.unstack(), annot=True, fmt='g')
plt.xlabel('0 - Mortos , 1 - Sobreviventes')
plt.ylabel('Total Acompanhantes')
Through the chart it is possible to confirm that most of the passengers were unaccompanied. If the analysis were based only on these two columns, regardless of class, sex and passage value, being unaccompanied would make the group less likely to survive.
We treat absent, irrelevant, and derivative data, now we find a new kind of data that needs to be manipulated, because the discrepancy of its content can significantly influence / induce the end result.
Disagreeable data: Fare
Fare has two issues that need to be evaluated. The first is the maximum price of the ticket is 512 being the average value 32. I will delve into the analysis of this data because these outliers can interfere in the result. The second is that the minimum value is 0. It can be, for example, children's tickets. This will be verified next.
Question 1: Is Fare's greatest value really discrepant?
# Sort the data in descending order to see if there are many high values such as max = 512.
titanic_df.Fare.sort_values(ascending=False).head(30)
The highest amount (512.3292) is very far from the second highest value (263.0000). The other values are pretty close to each other. Unless there are 3 luxury cabins with an exorbitant fare, which I do not believe, there is probably a flaw in the data. For passengers with the ticket equal to the highest value, I will assign the second highest value.
# Lists the 10 largest Fare values, but the list has repeated values, then filters
# Only unique (.unique) and returns the second line [1], because the first is the Max.
second_max_fare = titanic_df.Fare.nlargest(10).unique()[1]
second_max_fare
# Assigns the second highest value in the column (second_max_fare)
# Where Fare is equal to the maximum value of the column
titanic_df.Fare = titanic_df.Fare.apply(lambda x: second_max_fare if x==titanic_df.Fare.max() else x)
# Checks whether the max has been changed to the second value.
titanic_df.Fare.max()
Question 2: Are the values zeroed out of children or the elderly? Do they have any factor in common?
# Search result where the value of the passage is equal to 0.
titanic_df[titanic_df['Fare']==0]
The theory that they could be free tickets because of age was not confirmed. In the group there are no children or elderly people. In common the data has the masculine gender and absence of companions. I believe it is more likely a failure in the data and for this case I will assign the average value of the Fare column.
# Returns the average of the Fare column when the value passed as parameter is equal to 0.
titanic_df.Fare = titanic_df.Fare.apply(lambda x: titanic_df.Fare.mean() if x==0 else x)
# Summary
titanic_df.Fare.describe()
As can be seen, the minimum value (min) became 4.01 and the maximum (max) 263.
What is the correlation of variables as a survival factor? The pearson (.corr) method was used to demonstrate the dependency between them. The value can vary between 0 and 1, both negative and positive. A degree of correlation greater than 0.5 is expected to demonstrate a moderate to very strong correlation.
titanic_df.corr(method='pearson', min_periods=1)
As the Sex column is not numerical, no correlation was calculated for it. I believe it has a correlation with survived, so I'm going to turn it into a number, in an additional column, to measure the correlation.
# Assign an integer value to the categorized values of Sex (0 - female, 1 - male)
titanic_df['SexInt'] = map(int, titanic_df.Sex == 'male')
# Calls the correlation method by passing the pearson type as a parameter and saves the values in a dataframe.
correlation_df = titanic_df.corr(method='pearson', min_periods=1).abs()
correlation_df
The purpose of this analysis is to score the correlation between the variables in our data set, with no intention to establish, through these numbers, a cause for survival or to define statistical values without associated controlled test. It was important to add SexInt to represent the Sex variable. Sex (SexInt) and Pclass are the most correlated with Survived. It makes sense if we think that the location of the cabin may have made a difference and that women and children have taken priority in an emergency situation. Pclass and Fare are also correlated, mainly because the type / class cabin are factors for composition of ticket values.
Result classification - correlation p
correlation_df.unstack().sort_values(ascending=False)
=> Moderate correlation p>0.5
=> Weak correlation p >0.3
=> Very Weak correlation
Data is clean and ready for viewing.
We will begin the exploratory analysis with an overview of all the columns and the relationship of your data. The data are grouped into 0 -Died and 1-Survived
# Overview of all variables with the help of pairplot ().
sb.pairplot(titanic_df, hue='Survived', diag_kind='kde', size=2.5, markers=['o','s'], palette=['gray','red'])
This overview helps you choose which data to select for a more detailed view. For example, in SexInt it is clear that more men died on the Titanic. In Pclass can be perceived a greater volume of dead in the 3rd class.
Through the graphs we will try to understand better the content and importance of the age for the survival, in addition to its relationship with the class - Pclass.
fig = plt.figure(figsize=(18,6), dpi=1600)
# Create subplot1
ax1 = plt.subplot(2,2,1)
# Histogram Column Age
titanic_df.Age.hist(bins=10)
# Label x - Age
plt.xlabel("Age")
# Title
plt.title("Histrogram Age, (bin=10)")
# Create subplot2
ax2 = plt.subplot(2,2,2)
# Plot density of column Age
titanic_df['Age'].plot(kind='kde', style='k--')
# Label y
plt.ylabel("Density")
# Label x
plt.xlabel("Age")
# Title
plt.title("Densidade - Age")
# Create subplot3
ax3 = plt.subplot(2,2,(3,4))
# Plot density - Class
titanic_df.groupby('Pclass').Age.plot.kde()
# Label do eixo de x
plt.xlabel("Age")
# Title
plt.title("Distribution Age/Class")
# Legend
plt.legend(('1 Class', '2 Class','3 Class'),loc='best')
fig = plt.figure(figsize=(18,6), dpi=1600)
# Create subplot1
ax1 = plt.subplot(1,2,1)
# Plot density - Pclass
titanic_df.groupby('Pclass').Survived.plot.kde()
# Label x and y
plt.xlabel("0 - Died 1 - Survived")
plt.ylabel("Density")
# Title
plt.title("Distribution Survived by Class")
# Legend
plt.legend(('1 Class', '2 Class','3 Class'),loc='best')
# Create subplot2
ax2 = plt.subplot(1,2,2)
# Plot Count Male and Female
titanic_df.groupby('Sex').count()['Survived'].plot.bar()
plt.xlabel("Female - Male")
# Title
plt.title("Number of female and male")
The distribution of deaths and survivors by class leaves no doubt about the relationship between these variables. The highest death volume are in the 3rd class and survivors in the 1st. The second class has a small variation. The relationship between class and survival was one of the issues raised early in the paper. It is not possible to say that being in a certain class is the cause of survival, through the data it is only possible to observe that most of the survivors were in the first class.
In the data set the majority of passengers were men.
# Create survived
sobreviventes = titanic_df[titanic_df['Survived']==1]
sb.factorplot(x="Sex", y="Age", hue="Pclass",
col="Pclass", data=sobreviventes, kind="box", size=4, aspect=.5)
# Label x and y
plt.xlabel("Genero")
plt.ylabel('Idade')
Now that we know the data better and begin to identify the characteristics most common to the survivors, we will try to answer the questions that motivated this analysis with a more detailed investigation.
Question 1: Knowing that there was this misunderstanding in the distribution of passengers by boats, the data source titanic.cvs gives us some information that reflects this misunderstanding among the officers?
We know so far that most of the passengers were men, Young adult was the most common age group and that most of the dead were in the 3rd class. Titanic officials have prioritized children and women and it is to be expected, therefore, that men have a lower survival rate.
No detailed analysis will be done in the Age column, we will use the AgeRange derived column that contains the age range of the passenger.
# Plot violinplot with survivor distribution by sex and age group
ax = sb.violinplot(data=titanic_df, x='SexInt', y='Survived', hue='AgeRange')
ax.set(xlabel='(0)Mulheres , (1)Homens', ylabel='(0)Mortos, (1)Sobreviventes')
In the above chart it is very clear that, at least in the sample given to us, the majority of women survived regardless of age. Most of the men died, including the elderly. In both men and women the volume of surviving children is similar to that of the dead. Averages for the age group and gender.
Question 2: Were women and children more likely to survive?
Let's start with the survival rate of men and women regardless of age.
# Group by Sex
titanic_df.groupby(['Sex']).mean()
In the data set 74% of the women and 19% of the men survived. Good news for me, but will it be that if I add the Age Range, the percentage of survival remains favorable?
# Group by Age Range and Sex
titanic_df.groupby(['AgeRange','Sex']).mean()
In the group of young adult women the survival rate is 71%, well over 16% of men in this age group. With the exception of children, men in all age groups had a low survival rate. Children, regardless of gender, had a similar rate, 59% for girls and 56% for boys. Reflects well the information displayed by the violinplot, in it we can perceive a fairly egalitarian distribution in the distribution of dead and surviving children.
Delving deeper into the analysis and adding the age group the chance of survival of my group was reduced by 3%, before it was 74% and now has passed to 71%. Among the figures calculated for women, this is one of the lowest, loses only for the children, but I am still happy because there is yet to analyze one more element: the class.
Question 3: Was the class an important factor for survival?
To find out if it was more likely to survive in certain classes than others, let's investigate the survival rate of only the variable class, without interference from another such as Sex or AgeRange.
# Mean of group by Pclass
titanic_df.groupby(['Pclass']).mean()
This grouping confirms what we have seen in the graphs, the survival rate is significantly lower for 3rd class. This information alone does not mean much, because we know that women and children have better chances. Let's look at the distribution of values with the addition of Sex and AgeRange.
# Group data by age group, gender, class
faixa_etaria_genero = titanic_df.groupby(['AgeRange','Sex','Pclass']).mean()
faixa_etaria_genero
According to our sample, poor of the adult and elderly males in the 3rd class, their survival rate is less than 4%.
4th question: And the question I ask is if I were there, woman, young adult, in the 2nd class is it likely that I would have survived?
What we know so far:
To answer this question, we will pinch on the specific group:
jovem_adulta = titanic_df.groupby(['AgeRange','Sex','Pclass']).mean().T
jovem_adulta
The data we have is only a sample of Titanic passengers, some data were missing, others were excluded and a good part of the information was filled with assumed values, for this reason, the rate presented here is only valid as an experience in this context, not Meaning that a young adult in the 2nd class would have survived.
From now on let's make our analysis more interesting, let's add a bit of machine learning while training a program in the Titanic data sample. After "trained" we will test the prediction for a Young Adult in the 2nd class. Will we get a different result from the exploratory analysis?
Preparing the data
titanic_df.head()
# Create a copy of titanic_df
processed_df = titanic_df.copy()
# Delete SexInt column
processed_df.drop('SexInt',axis=1, inplace=True)
processed_df.head()
The data in the AgeRange and Sex columns are categorized and will be transformed with the help of LabelEncoder into numeric values that are read more easily by the model.
le = preprocessing.LabelEncoder()
# Sex and AgeRange columns receive their numerical version
processed_df.Sex = le.fit_transform(processed_df.Sex)
processed_df.AgeRange = le.fit_transform(processed_df.AgeRange)
# Check the result
processed_df.head()
# X Receives all values from the dataset minus the Survived column that will be used in the comparison
X = processed_df.drop(['Survived'], axis=1).values
# y Receives the values from the Survived column that will be used by the model as a comparison
y = processed_df['Survived'].values
After the two sets of data have been created, X containing all values less than Survived and y which contains only the values of Survived (expected result), the data will be divided into the training and test datasets. The model will be trained by the classification algorithm using X_train and y_train. Once the template is ready, they will be used to sort the test.
# Divide the matrices into test and training
X_train, X_test, y_train, y_test = model_selection.train_test_split(X,y,test_size=0.2)
Training the model:
# Create Decision Tree
clf_dt = tree.DecisionTreeClassifier(max_depth=5)
# Train model
clf_dt.fit(X_train, y_train)
Average accuracy of test data
# Check the accuracy of the model
clf_dt.score (X_test, y_test)
The decision tree model was able to predict the survival of at least 81% of the data.
Simulated data
Based on the information we have already verified, we will test the model by entering simulated data of the main passenger (Young adult, 2nd class), we will make the class variation for this same passenger and we will include two more passengers that are in a low rate group Survival. Will the result be similar to the exploratory analysis?
Set we will test:
| Pclass | Sex | Age | Fare | AgeRange |
|---|---|---|---|---|
| 1 | Female | 32 | 109 | Young Adult |
| 2 | Female | 32 | 21 | Young Adult |
| 3 | Female | 32 | 14 | Young Adult |
| 3 | Male | 80 | 23 | Elderly |
| 2 | Male | 32 | 8 | Young Adult |
# Create dataset with simulated passenger information
# Features: Pclass, Sex, Age, Fare, Family, AgeRange
passageiros_simulados = [[1, 0, 32, 109, 0, 4],
[2, 0, 32, 21, 0, 4],
[3, 0, 32, 14, 0, 4],
[3, 1, 80, 23, 5, 3],
[2, 1, 32, 8, 0, 4]]
# Prediction for the simulated data set
clf_dt.predict(passageiros_simulados)
Remembering: 0 - Died, 1 - Survived
How interesting! The result obtained with the Decision Tree model predicts that a young adult in the first, second, third class would survive. To test the model's "competence" in predicting results, I changed the sex of the main passenger, because in the analysis phase the men had a lower survival rate, and as was to be expected, the model predicted death for him. Cruel, but within expectation. With no surprises whatsoever, death was predicted for our 80-year-old passenger.
Generating the .dot file with the Decision Tree model decisions:
# Generates the GraphViz representation of the decision tree. The data is recorded in the file titanic_tree.dot
#Data can be viewed graphically at http://www.webgraphviz.com/
tree.export_graphviz(clf_dt, out_file='titanic_tree.dot', feature_names=processed_df.columns[1:])
The file 'titanic_tree.dot' is in github: https://github.com/liebycardoso/Intro_Data_Analysis
In order to facilitate visualization, the GraphViz representation of the decision tree with max_depth = 5 was generated.

Using the Random Forest model
DecisionTree is a simple decision tree model, I will train the data in random forest also because it will decide using multiple trees and return the most common result between them.
# Create Random Forest
clf_rf = RandomForestClassifier(n_estimators=100, oob_score=True)
# Train model the same way we did with DecisionTreeClassifier
clf_rf.fit(X_train, y_train)
Predicting the result
# Predict result
Y_pred = clf_rf.predict(X_test)
# Check accuracy
clf_rf.score(X_train, y_train)
In this case it is possible to predict survival with 98% accuracy.
# Out-of-bag estimate (oob) error: 81%
clf_rf.oob_score_
We will test the same data set used with Decision Tree to predict the result with Random Forest. That way we will be able to compare the result of the two.
# Prediction for the simulated data set
clf_rf.predict(passageiros_simulados)
Comparison between the result of the Decision Tree and Random Forest:
| Pclass | Sex | Age | Fare | AgeRange | Decision Tree | Random Forest |
|---|---|---|---|---|---|---|
| 1 | Female | 32 | 20 | Young Adult | Survived | Survived |
| 2 | Female | 32 | 20 | Young Adult | Survived | Survived |
| 3 | Female | 32 | 20 | Young Adult | Survived | Survived |
| 3 | Male | 80 | 20 | Elderly | Died | Died |
| 2 | Male | 32 | 20 | Young Adult | Died | Died |
The two models (Random Forest and Decision Tree) returned the same result for the data set. Even in the simulation I changed the Young Adult passenger class to 3rd, since during the exploratory analysis we saw that in the 3rd class the survival rate for the young adult was 52%, so death or survival were two possible results for this record And in this case survival was predicted. The 1st class rate was 97% and the 2nd class rate was 92%. This result is in line with the conclusions we had during the analysis of the data. Woman, Young Adult, both in the 1st or 2nd class presented a good probability of survival in the models and the exploratory analysis. For men the result is also in agreement with the analysis, since at all times the survival rate for them was shown to be low.
The importance of each column in achieving the result
Below will be listed in order of importance of each feature used in the model for the final result.
feat_importance = pd.Series(clf_rf.feature_importances_, index=processed_df.drop(['Survived'], axis=1).columns)
feat_importance.sort_values(ascending=False)
Chart with the importance of each feature:
plt.barh(np.arange(len(feat_importance)), feat_importance, alpha=0.7)
plt.yticks(np.arange(.5,len(feat_importance),1), feat_importance.index)
plt.xlabel('Importancia')
plt.ylabel('Variavel')
plt.title('Importancia de cada variavel')
The importance hierarchy of each variable is very close to that obtained in the pearson correlation. Being Fare, Sex and Age are most correlated with the data.
A dataframe with the prediction (Y_pred) and X_test values will be created. It is a dataframe to simulate the result of the Titanic data set.
# Create dataframe
predicao_df = pd.DataFrame(X_test, columns=['Pclass','Sex', 'Age', 'Fare', 'Family','AgeRange'])
predicao_df['Predict'] = Y_pred
predicao_df['Survived'] = y_test
predicao_df.head()
Checking the values for the new Predict column:
predicao_df.groupby(['Predict']).mean()
The same grouping, only using titanic_df as dataset:
titanic_df.groupby(['Survived']).mean()
As you can see, the figures are pretty rough. For example:
For easy analysis, I'll re-plot Sex / Age / Pclass for the surviving passengers. First the passenger data of the titanic_df will be printed and following the predicao_df.
# Print chart with the bloxpot of the middle age by gender and class in the original dataframe titanic_df
sb.factorplot(x="Sex", y="Age", hue="Pclass",
col="Pclass", data=sobreviventes, kind="box", size=4, aspect=.5)
plt.title("Dados titanic_df")
# Print chart with the bloxpot of the middle age by gender and class in the dataframe predicao_df
predicao_sobrevivente = predicao_df[predicao_df['Predict']==1]
sb.factorplot(x="Sex", y="Age", hue="Pclass",
col="Pclass", data=predicao_df, kind="box", size=4, aspect=.5)
plt.title("Dados predicao_df")
# Mean group by Sex, Class Predict
predicao_df.groupby(['Sex','Pclass','Predict']).mean()['Age']
# Total survivors by Sex / Pclass
predicao_df.groupby(['Sex','Pclass','Predict']).count()['Age'].unstack()
Analyzing the table above we can see that the model predicted survival for all 1st and 2nd Class passengers.
Filters:
After investigating the Titanic data set we found that women had a higher survival rate than men, that first and second class passengers also had better rates. In this scenario it was investigated whether a young adult in the 2nd class would survive, and we assumed, based on the data and prediction of Decision Tree and Random Forest models, that yes, that passenger would have a good chance of surviving.
The observations, findings and results do not represent the reality of the facts, because we are only working with an incomplete sample of the data.
The sample has information on 891 passengers and it is known that 2,223 people were aboard the Titanic. The sum of dead is greater than the sample size:
Another limitation of the analysis is the values that were assumed instead of the missing values that possibly affected the final result. Some information that was created to assist with the analysis:
1) Age: The null values were replaced by the general mean of the Age column.
2) Ticket Price: The highest ticket value was far from the second highest value, so I replaced the larger ticket with the second largest ticket. The zero values were replaced by the general mean of the Fare column.
The Cabin column was excluded from the analysis because it has many null values, but the values of this variable suggest a pattern that can be investigated in a new analysis. The same happens with the information on the treatment given to the passenger (Miss, Mrs, Mr, Master, Major and etc.), can we establish a standard for this column too? And would this treatment help predict outcomes? These are some questions that were not answered here and that, because they were hidden, may have interfered with the final result.
Although the supervised model Random Forest has applied well to this data, another model, such as support vector machine (SVM) can be tested if you are interested in extending this research.
A very important factor is that the data refer to humans and their behavior and decisions at the time of risk rely on many variables that are unknown by this analysis. For example, survivors reported the difficulty in convincing some passengers to board, I imagine that some women had difficulty leaving their husbands and older children, perhaps they did not even leave. The incorrect understanding of the captain's orders caused some men in a certain location of the ship to be authorized to embark and the men on the other side of the ship did not receive such authorization and eventually contributed to the reduction of the survival rate of the male gender.
Because there are so many variables and the lack of some information, we can even have fun in the investigation of the data, but we can not attribute a statistical value to this work.